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Abstract 

The  researchers  made  significant  progress  in  all  of  the  proposed  research  areas.  The  first  major 
task  in  the  proposal  involved  risk-sensitive  control  and  estimation.  In  support  of  this  task,  the 
researchers  made  strides  toward  a  deeper  understanding  of  risk-sensitive  estimation  and  Markov 
chains. 

In  support  of  the  second  task,  the  researchers  made  progress  incorporating  simulation-based 
optimization  and  population- based  methods  into  optimization  problems.  They  made  significant  * 
progress  on  new  s  i  mu  1  at  ion- based  global  optimization  methods,  as  well  as  on  evolutionary  ap¬ 
proaches  to  solving  Markov  Decision  Processes  (MDPs},  new  sampling  methods  for  MDPs,  simulation- 
based  methods  for  MDPs,  new  approaches  to  the  allocation  of  simulation  replications  for  optimiza¬ 
tion,  and  applications  of  these  algorithms. 

In  support  of  the  third  major  task  that  involves  estimation  and  control  algorithms  for  graphical 
models  and  networked  systems,  the  researchers  made  progress  on  developing  scalable  algorithms 
for  inference  on  graphical  models.  In  particular,  they  developed  a  new  framework  for  distributed, 
dynamic  tracking  and  data  association  for  multiple  targets  from  multiple,  distributed  sensing  nodes. 
This  new  method  exploits  both  their  cominunications-sensitive  algorithm  and  their  method  of 
Nonparametric  Belief  Propagation. 
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1  Introduction 


In  this  research  project,  we  proposed  to  investigate  integrated  risk-sensitive,  simulation- based, 
population-based  and  graphical  methodologies  for  planning,  estimation,  and  control  that  can  be 
effective  tools  in  an  integrated  approach  to  Global  Awareness  (Intelligence,  Surveillance  and  Recon¬ 
naissance,  or  ISR)  and  Command  and  Control  (C2).  The  questions  we  investigated  were  motivated 
by  future  Air  Force  requirements,  which  will  involve  a  flexible  and  world-responsive  set  of  missions. 
Issues  that  arise  in  this  context  include  the  fact  that  information  for  both  training  and  operations 
may  arrive  just  in  time  (in  fact,  perhaps  even  as  forces  are  being  deployed),  requiring  a  much  more 
agile,  responsive,  and  integrated  ISR-C2  system.  A  key  idea  is  that,  rather  than  separate  ISR  and 
C2  planning  and  execution  cycles  and  collection  managers  dedicated  to  certain  assets,  there  should 
be  dynamic  feedback  between  the  ISR  and  C2  systems  (he.,  between  commanders  and  collection 
managers },  and  there  should  be  dynamic  allocation  of  sensing,  collection,  and  processing  assets. 

Such  systems  are  exceedingly  complex,  and  we  combined  four  approaches  in  the  study  of  such 
problems: 

•  Utilizing  risk-sensitive  cost  functions  to  achieve  robustness  and  incorporate  risk; 

•  Using  simulation  and  other  numerical  methods  for  sequential  decision  making  under  uncer¬ 
tainty; 

•  Developing  and  studying  efficient  simulation-based  and  sampling  methodologies  for  global 
optimization  problems; 

•  Utilizing  the  structure  of  the  system  (in  particular,  graphical  and  networked  models)  to  design 
scalable  fast  algorithms  for  planning,  estimation,  and  control 

Graphical  models  represent  a  powerful  framework  capable  of  capturing  spatio-temporal  and 
hierarchical/ multi-granularity  relationships  (cf.  the  Report  of  the  TYi-Servicc  Working  Group  on 
the  Role  of  Probability  and  Statistics  in  Command  and  Command  and  Control  [78]  (Principal 
Authors:  Prof.  Alan  S.  Whisky,  Prof.  Steven  L  Marcus,  and  Dr.  Wendy  Poston).  The  exploitation 
of  structure,  such  as  that  inherent  in  graphical  models,  is  essential  for  the  computations  involved 
in  estimation  and  planning  to  be  feasible.  It  has  been  our  intention  to  integrate  the  risk-sensitive, 
simulation- based,  population-based  and  computational  methods  discussed  below  with  work  on  the 
control  and  estimation  of  systems  described  via  graphical  models,  such  as  trees,  dynamic  Bayes 
networks,  networked  systems,  and  Petri  nets. 

Simulation  and  sampling  can  be  effective  tools  for  the  analysis,  design,  and  control  of  such 
systems  (e.g.,  Andradottir  1998,  Jacobson  and  Schruben  1989,  Fii  20Q2ab).  Even  those  problems 
that  can  in  principle  be  modeled  using  analytical  techniques  such  as  Markov  chains  may  lead  to 
computationally  intractable  models.  A  typical  example  of  this  is  a  large  queueing  network  with 
general  arrival  processes  and  service  time  distributions,  for  which  a  simulation  model  can  be  easily 
and  quickly  built. 

The  need  to  address  sequential  decision  making  under  uncertainty  has  led  to  the  recent  research 
focus  on  simulation- based  methods  for  solving  Markov  decision  processes  (MBPs),  which  provide 
a  useful  framework  for  formulating  these  types  of  problems  (e.g.,  Bertsekas  and  Tsitsiklis  1996, 
Sutton  and  Barto  1998).  Most  of  the  approaches  have  concentrated  on  approximating  the  value 
function,  in  effect  reducing  the  dimensionality  of  the  state  space  to  a  manageable  number  through 
a  suitable  parameterization  (e.g.,  Das  et  al.  1999,  Van  Roy  and  Tsitsiklis  2001).  The  function 
approximation  is  carried  out  via  a  number  of  different  techniques,  including  the  use  of  basis  functions 
and  neural  networks,  where  simulation  is  used  to  provide  samples  in  order  to  fit  curves.  The  key 
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idea  throughout  is  to  avoid  enumerating  the  entire  state  space.  The  approaches  studied  in  this 
research  are  meant  to  complement  these  highly  successful  techniques. 

The  goal  of  global  optimization  is  to  find  parameter  values  that  achieve  the  optimum  of  an 
objective  function.  In  general,  due  to  the  presence  of  multiple  local  optimal  solutions,  global 
optimization  problems  are  extremely  difficult  to  solve  exactly.  Solution  methods  for  both  continuous 
and  combinatorial  global  optimization  problems  can  be  categorized  as  being  either  instance- based  or 
model-based  (cf.  Ziochin  2004).  In  instance-based  methods,  the  searches  for  new  candidate  solutions 
depend  explicitly  on  previously  generated  solutions.  Some  well-known  approaches  are  simulated 
annealing  (SA)  (Kirkpatrick  1983),  genetic  algorithms  (GAs)  (Srinivas  1994),  tabu  search  (Glover 
1990),  and  the  recently  proposed  nested  partitions  (NP)  method  (Shi  2000), 

Model-based  search  methods  are  a  new  class  of  solution  techniques  introduced  in  recent  years.  In 
model-based  algorithms,  new  solutions  are  generated  via  an  intermediate  probabilistic  model  that 
is  updated  or  induced  from  the  previously  generated  solutions.  In  general,  most  of  the  algorithms 
that  fall  in  this  category  share  a  similar  framework  and  involving  the  following  two  phases: 

1.  Generate  candidate  solutions  (random  samples,  trajectories)  according  to  a  specified  proba¬ 
bilistic  model  (e.g.,  a  parameterized  probability  distribution  on  the  solution  space). 

2.  Update  the  probabilistic  model,  on  the  basis  of  the  data  collected  in  the  previous  step,  in 
order  to  bias  the  future  search  toward  “better”  solutions. 

Some  well  established  techniques  that  belong  to  the  modd-based  methods  are  the  cross-entropy 
(CE)  method  (Deboer  2005,  Manner  2003,  Rubinstein  1997,  Rubinstein  1999,  Rubinstein  2001,  Ru- 
binstein2Q04),  a  class  of  algorithms  called  the  estimation  of  distribution  algorithms  (EDAs)  (Lar- 
ranaga  1999,  Muhlenbeinl996,  Pelikan  1999),  and  the  so-called  annealing  adaptive  search  (A AS) 
(Shen  2005,  Zabinsky  2003).  Key  questions  in  model-based  methods  are:  (i)  how  to  efficiently 
update  the  probability  distributions,  and  (h)  how  to  efficiently  sample  from  the  probability  dis¬ 
tributions.  While  many  solution  techniques  have  been  proposed,  there  is  still  a  need  for  efficient 
algorithms  that  arc  based  on  a  precise  mathematical  framework,  are  provably  convergent,  converge 
to  an  optimal  solution  quickly,  are  easy  to  implement,  and  handle  both  continuous  and  combina¬ 
torial,  deterministic  and  stochastic  optimization  problems. 

Our  approach  has  been  based  on  the  following  key  points: 

•  Scalable  fast  algorithms  will  only  be  possible  if  one  can  exploit  the  inherent  structure  of  the 
system. 

•  New  and  efficient  approaches  for  incorporating  rigorous  simulation  and  statistical  methods 
are  required  for  the  solution  of  difficult  optimization  and  sequential  decision  making  problems. 

•  Often  it  is  much  easier  mid  more  efficient  to  simulate  complex  systems  than  to  model  them 
analytically. 

•  Risk-sensitive  objective  functions  are  an  effective  approach  for  incorporating  risk  and  achiev¬ 
ing  robustness. 


2  Simulation-based  and  Sampling  Methods  for  Global  Optimiza¬ 
tion 

We  have  considered  the  following  optimization  problem: 

x*  E  arg max# (x),  x  E  X  C  3f V\  (1) 
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where  the  solution  space  X  (which  can  be  either  continuous  or  discrete)  is  a  non-empty  subset  of 
K",  and  H{  )  :  X  — *  5ft  is  a  deterministic  function  that  is  bounded  from  below,  be,,  3  M  >  —oo 
such  that  H(x)  >  M  V  x  £  X.  Wc  assume  that  problem  (1)  has  a  unique  global  optimal  solution, 
he.,  there  exists  x*  £  X  such  that  H{x)  <  H{x *)  for  all  x  ^  x  £  AT  but  there  may  be  many 
local  optimal  solutions. 

In  Hu,  Marcus,  and  Fu  (2006b),  we  presented  a  general  model- based  global  optimization  frame¬ 
work  called  model  reference  adaptive  search  (MRAS).  The  motivation  behind  MR  AS  is  to  use  a 
sequence  of  intermediate  reference  distributions  to  facilitate  and  guide  the  updating  of  the  pa¬ 
rameters  associated  with  the  family  of  parameterized  distributions  during  the  search  process.  The 
sequence  of  reference  distributions  in  MRAS  are  selected  such  that  they  can  be  shown  to  converge 
to  a  degenerate  distribution  concentrated  only  on  the  set  of  optimal  solutions.  The  sequence  of  ref¬ 
erence  models  is  only  used  implicitly  to  guide  the  parameter  updating  procedure,  in  contrast  to  the 
usual  Estimation  of  Distribution  Algorithms  (EDAs),  where  the  distributions  must  be  constructed 
explicitly.  At  each  iteration  of  MRAS,  candidate  solutions  are  generated  from  the  distribution 
{among  the  prescribed  family  of  distributions)  that  possesses  the  minimum  Kullback-Leibler  (KL) 
divergence  with  respect  to  the  reference  model  corresponding  to  the  previous  iteration.  These 
candidate  solutions  are  in  turn  used  to  construct  the  next  distribution  that  has  the  minimum 
KL-divergcnce  with  respect  to  the  current  reference  model,  from  which  future  candidate  solutions 
will  be  generated.  For  a  class  of  parameterized  probability  distributions,  the  so-called  Natural 
Exponential  Family  (NEF),  the  algorithm  converges  to  an  optimal  solution  with  probability  one. 

To  explain  the  main  idea  behind  MRAS,  we  consider  the  following  naive  model-based  approach 
for  solving  (1).  Let  g$(x)  >  0  Vx  £  X  be  an  initial  probability  density/ mass  function  (p.d.f./p.m.f.) 
on  the  solution  space  AT  At  each  iteration  k  >  1,  we  compute  a  new  p.d.f.  by  tilting  the  old  p.d.f. 
^_i{x)  with  the  performance  function  H(x)  {for  simplicity,  here  we  assume  H(x)  >  0  Var  €  A"), 


ieM 


9k(x) 


#(s)gfc-i(g) 

ly  H{x)gk_i(dx)' 


Vi  €  X, 


(2) 


By  doing  so,  we  are  assigning  more  weight  to  the  solutions  that  have  better  performance.  One  direct 
consequence  of  this  is  that  each  iteration  of  (2)  improves  the  expected  performance.  To  be  precise, 
let  X  =  (AT, , Xn)  be  a  random  variable  taking  values  in  X.  To  reduce  the  notational  burden, 
henceforth  X  will  be  used  to  denote  a  random  variable  having  the  distribution  under  which  the 
expectation  is  indicated.  Thus,  E9k[H{X)\  ^  fx  H(x)g^{dx)  and  Egk  Jf/(X)]  —  fx  H(x)yk-\(dx). 
Then  we  have 


E3k\H(X)\  = 
> 


ESk-AH(X)\ 

Egk_x[H{X)). 


Furthermore,  it  is  possible  to  show  that  the  sequence  of  p.d.f.  *s  A:  =  0, 1, . . .}  will  converge 

to  a  p.d.f.  that  concentrates  only  on  the  set  of  optimal  solutions  for  arbitrary  £o(  )-  So  we  will 
have  lim^oo  E9k[H(X) }  =  H(x'). 

However,  the  above  approach  is  generally  of  little  practical  use,  due  to  the  following  reasons:  (i) 
It  is  usually  not  possible  to  enumerate  all  the  points  in  the  solution  space  in  order  to  perform  the 
update  (2);  furthermore,  if  it  were  possible,  the  optimal  solution  could  be  immediately  identified 
simply  by  checking  which  point  has  the  best  performance  value,  (ii)  The  p.d.f.  gk(%)  constructed 
at  each  iteration  may  not  have  any  structure,  and  therefore  may  be  very  difficult  to  handle. 

To  overcome  the  above  difficulties,  we  considered  in  Hu,  Fu,  and  Marcus  (2006b)  the  Monte 
Carlo  (sampling)  version  of  the  above  approach  and  at  the  same  time  restrict  ourselves  to  a  family 
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of  parameterized  p.d.f.’s/p.m.f.’s  {/(*,$)},  where  0  is  the  parameter  vector.  In  particular,  at  each 
iteration  A;  of  the  algorithm,  we  look  at  the  projection  of  g^(  )  on  the  family  of  p-d.f.’s/p.m.f.’s 
{/(■,#)}  and  compute  the  parameter  vector  that  minimizes  the  KiiUhack-Leibler  (KL)  divergence 


V(9kJ(;0))  :=Egk 


In 


Sk(X)  i  =  r 

f(X,9)  J  Ux 


In 


/(*.  0) 


(3) 


where  v  is  the  Lebesgue/counting  measure  defined  on  A"*  The  benefits  of  the  above  consideration 
are  twofold:  on  the  one  hand,  f{-,0k)  often  has  some  special  structure  and  therefore  could  be 
much  easier  to  handle  than  <?*(-)■  On  the  other  hand,  the  sequence  {/(-,0/J}  may  retain  some 
nice  properties  of  {<&(  )}  and  converge  to  a  degenerate  p.d.f*  concentrated  on  the  set  of  optimal 
solutions* 


2,1  The  MRASq  Algorithm  (Exact  Version) 

Let  P$k{')  and  Eek{ ]  denote  the  probability  and  expectation  taken  with  respect  to  the  p*d<f./p*m.f* 
and  let  }  denote  the  indicator  function,  he., 


hA) 


1  if  event  A  holds, 
0  otherwise. 


Thus,  in  this  notation, 

Pgk  (H(X)  >  7)  =  f  /(»(«)>,>/(*,  ek)u{dx\ 
E0k[H(X))=  f  H(x)f(x,0k)u(dx). 


Algorithm  Description 

The  MRASq  algorithm  requires  specification  of  a  parameter  p ,  which  determines  the  approximate 
proportion  of  samples  that  will  be  used  to  update  the  probabilistic  model*  At  successive  iterations 
of  the  algorithm,  a  sequence  {7*, A:  =  1,2,..,},  he*,  the  (1  -  p)~quantiles  with  respect  to  the 
sequence  of  p.d.f’s  are  calculated  at  step  1  of  MRASq.  These  quantile  values  arc  then 

used  in  step  2  to  construct  a  sequence  of  non-decreasing  thresholds  {7^,  k  =  1,2,...};  and  only  those 
candidate  solutions  that  have  performances  better  than  these  thresholds  will  be  used  in  parameter 
updating  (cf.  equation  (4).  The  theoretical  convergence  of  MRASq  is  unaffected  by  the  value  of  the 
parameter  p.  The  purpose  of  p  in  our  approach  is  to  concentrate  the  computational  effort  on  the  set 
of  elite/promising  samples,  which  is  a  standard  technique  employed  in  most  of  the  population-based 
approaches,  like  GAs  and  ED  As* 

During  the  initialization  step  of  MRASq,  a  small  number  e  and  a  continuous  and  strictly  in¬ 
creasing  function  S(0  :  5R  -+  9?+  are  also  specified*  The  function  £(■)  is  used  to  account  for  the 
cases  where  the  values  of  H{x)  are  negative  for  some  x,  and  the  parameter  e  ensures  that  each 
strict  increment  in  the  sequence  {7*}  is  lower  bounded,  i*e*, 

inf  (7fc+i  -%)  >  e. 

We  require  e  to  be  strictly  positive  for  continuous  problems,  and  non- negative  for  discrete  problems* 
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Algorithm  MRASoi  Model  Reference  Adaptive  Search  -  exact  version 

*  Initialization;  Specify  the  parameter  p  €  (0,  1],  a  small  number  e  >  0,  a 
continuous  and  strictly  increasing  function  S(  )  :  5?  — *  3?+s  and  an  initial 
pdX/p.m.f.  f(x,Qo)  >  0  Vt  €  X.  Set  the  iteration  counter  k  —  0. 

*  Repeat  until  a  specified  stopping  rule  is  satisfied: 

1.  Calculate  the  (1  -  p)-quantile 

7 fc+i  sup  {f  :  Pek  (H(X)  >  /)  >  p}  . 


2.  if  k  =  0,  then  set  yk+i  ~  7*+i- 

elsetf  k  >  1 

if  7fc+i  >  7*  +  £,  then  set  %+i  —  7*+]* 
else  set  7fc+i  =  >■ 
end  if 


endif 

3.  Compute  the  parameter  vector  as 


0jt+i  ;=  arg  max 


EBk 


\smx)t 

nx,ok) 


(4) 


4,  Set  k  =  k  +  L 


In  continuous  domains,  the  division  by  f(x,6 *)  in  the  performance  function  in  step  3  is  well  de¬ 
fined  if  /(x,  &k)  has  infinite  support  (e.g.  normal  p.d.f.),  whereas  in  discrete/eombinatorial  domains, 
the  division  is  still  valid  as  long  as  each  point  x  in  the  solution  space  has  a  positive  probability  of 
being  sampled.  Additional  regularity  conditions  on  f(x}$k)  ensure  that  step  3  of  MRASn  can  be 
used  interchangeably  with  the  following  equation: 

0k+i  =  argmax  f  [S(H(x))]k  l{H(x)>yk+t)  ln/{x, 0)u(dx). 

The  following  lemma  shows  that  there  is  a  sequence  of  reference  models  {#*{')>  k  =  1,2, .. .} 
implicit  in  MRASq*  and  the  parameter  computed  at  step  3  indeed  minimizes  the  KL-divergenec 
p(flfc+i, /(-,«))• 

Lemma  The  parameter  0/^j  computed  at  the  A;th  iteration  of  the  MRASq  algorithm  minimizes 
the  KL-divergence  T>  (5^+1,  /(',#))*  where 


9k+l(x) 


E,t  [s(HOT)/,„m>Wl)] 


k  =  1,2,. 


9i(z)  ■= 


Ee0\ 

[  f(*M  J 

2.2  Global  Convergence 

Global  convergence  of  the  MRASo  algorithm  clearly  depends  on  the  choice  of  parameterized  dis¬ 
tribution  family.  The  algorithm  may  not  be  computationally  tractable  for  some  choices.  In  Hu, 
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Fu,  and  Marcus  ( 2006b) ,  we  have  utilized  a  particular  family  of  p.dX’s/p.m.f.’s  called  the  natural 
exponential  family  (NEF),  for  which  the  global  convergence  properties  can  be  established. 

Definition,  A  parameterized  family  of  p.d.Fs  {/(  -,0),  0  e  0  C  Sm}  on  A1  is  said  to  belong  to  the 
natural  exponential  family  (NEF)  if  there  exist  functions  /i{-)  :  Sin  — *  3?,  F(-)  :  3Rn  — ►  and 
K(  )  :  3?™  — *  5R  such  that 

f{x,  0)  -  exp  {eTr(x)  -  I<(0)}  h(x),  W  e  6,  (5) 

where  K(8)  =  In exp  h(x)dx,  and  the  superscript  “T”  denotes  the  vector  trails- 

position.  Many  common  p.d.f.’s/p.m.f.’s  belong  to  the  NEF,  e.g.t  Gaussian,  Poisson,  binomial, 

geometric,  and  certain  multivariate  forms  of  them.  Some  regularity  conditions  are  needed  to  prove 
convergence  of  the  algorithm. 

Assumptions: 

Al.  There  exists  a  compact  set  n  C  A  such  that  the  level  set  {x  :  H(x)  >71}  C  Or  where  71  = 
supf{/  :  P&q(H( X)  >  l)  >  p}  ts  defined  as  in  the  MR  A  So  algorithm . 

A2,  For  any  given  constant  £  <  //(x*),  the  set  {x  :  H(x)  >  £}  has  a  strictly  positive  Lebesgue 
measure. 

A3.  For  any  given  constant  5  >  0,  supx€ilj  H(x)  <  H{x*),  where  As  :=  {x  :  ||x  —  x*||  >  £}♦ 

A4.  The  maximizer  of  equation  (4)  is  an  interior  point  ofB  for  all  k . 

A5.  svip^e0  ||exp{67  r(x}}F{x)h(x)|j  is  integrablefsummable  with  respect  to  x f  where  $,  F()>  and 
h(  )  are  defined  as  above . 

A6.  F(“)  :  5Rm  — *  5ftn  given  above  is  a  continuous  mapping . 

Remark  1:  Assumptions  A1—A3  are  regularity  conditions  imposed  on  the  optimization  problem 
to  be  solved,  whereas  assumptions  A4-AG  are  restrictions  imposed  on  the  parameterized  family 
of  p.d.f.’s.  Al  is  satisfied  if  the  function  H(  )  has  compact  level  sets  or  the  solution  space  A  is 
compact.  Intuitively,  assumption  A2  ensures  that  any  neighborhood  of  the  optimal  solution  x* 
will  have  a  positive  probability  of  being  sampled;  it  is  satisfied  if  the  objective  function  H(+)  is 
continuous  at  x*.  Since  H(  )  has  a  unique  global  optimizer,  A3  is  satisfied  by  many  functions 
encountered  in  practice,  and  is  guaranteed  to  hold  if  A  itself  is  compact.  In  actual  implementation 
of  the  algorithm,  step  3  of  MRASo  is  often  posed  as  an  unconstrained  optimization  problem,  i.e,, 
0  =  in  which  case  A4  is  automatically  satisfied.  It  is  also  easy  to  verify  that  A5  and  AG  are 
satisfied  by  most  NEFs. 

Theorem:  Let  k  =  1,2, _ }  be  the  sequence  of  parameters  generated  by  MRASo-  If  £  >  0 

and  assumptions  Al— A6  are  satisfied,  then 

lira  E0k  [F(X)]  =  F(x*).  (6) 

k— *00 


Remark  2:  Notice  that  when  F£x)  is  a  one-to-one  function  {which  is  the  case  for  many  NEFs  used  in 
practice),  the  convergence  result  (G)  can  be  equivalently  written  as  F_1  (lim*^oc  ^0*  |F(X)])  =  x*. 
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Also  note  that  the  limit  in  equation  (6)  is  component-wise.  For  some  particular  p.d.f.’s/p.m.f.’s, 
the  solution  vector  x  itself  will  be  a  component  of  F(x)  (e.g.,  multivariate  normal  distribution)* 
Under  these  circumstances,  we  can  disregard  the  redundant  components  and  interpret  equation  (6) 
as  Hindoo  i^\  “  x* '  Another  special  case  of  particular  interest  is  when  the  components  of  the 
random  vector  X  =  (AT . . . ,  Xn)  are  independent,  Le.,  each  has  a  univariate  p.dT  of  the  form 

=  exp (xi$i  -  K(tii))h(x €  R,V  i  =  1,. 

In  this  case,  since  the  p.d.f*  of  the  random  vector  X  is  simply  the  product  of  the  marginal  p.d.L’s, 
we  will  clearly  have  F(x)  =  i,  Thus,  equation  (6)  is  again  equivalent  to  Hindoo  E#k  [ X ]  —  x \ 
where  fl*  ■=  (tffi *  —  and  is  the  value  of  at  the  kth  iteration. 


2*3  Monte  Carlo  Algorithm 


The  MRASq  algorithm  describes  the  idealized  situation  where  quantile  values  and  expectations  can 
be  evaluated  exactly.  In  practice,  we  will  usually  resort  to  its  stochastic  {sampled,  or  simulation- 
based)  counterparts  where  only  a  finite  number  of  samples  are  used  and  expected  values  are  replaced 
with  their  corresponding  sample  averages.  For  example,  step  3  of  MRASq  will  be  replaced  with 


1  N 

0fc+1=argm^-g 


[5(H(Xi))3fc 

f(XiA) 


(7) 


where  are  i;i,d*  random  samples  from  /(x, 0fc)i  0*  is  the  estimated  parameter  vector 

computed  at  the  previous  iteration,  and  7^+1  is  a  threshold  determined  by  the  sample  (1  -  p)- 
quantile  of  H{X\),  *  ♦  * ,  H(Xpj). 

However,  the  theoretical  convergence  can  no  longer  be  guaranteed  for  a  simple  stochastic  coun¬ 
terpart  of  MRASq.  In  particular,  the  set  {x  :  H(x)  >  7^+1}  involved  in  equation  (7)  may  be  empty, 
since  all  the  random  samples  generated  at  the  current  iteration  may  be  much  worse  than  those 
generated  at  the  previous  iteration.  Thus,  we  can  only  expect  the  algorithm  to  converge  if  the 
expected  values  in  the  MRASq  algorithm  are  closely  approximated.  Obviously,  the  quality  of  the 
approximation  will  depend  on  the  number  of  samples  to  be  used  in  the  simulation,  but  it  is  difficult 
to  determine  in  advance  the  appropriate  number  of  samples*  A  sample  size  too  small  will  cause  the 
algorithm  to  fail  to  converge  and  result  in  poor  quality  solutions,  whereas  a  sample  size  too  large 
may  lead  to  high  computational  cost. 

As  mentioned  earlier,  the  parameter  p,  to  some  extent,  will  affect  the  performance  of  the 
algorithm.  Large  values  of  p  mean  that  almost  all  samples  generated,  whether  ‘"good11  or  “bad”, 
will  be  used  to  update  the  probabilistic  model,  which  could  slow  down  the  convergence  process. 
On  the  other  hand,  since  a  good  estimate  will  necessarily  require  a  reasonable  amount  of  valid 
samples,  the  quantity  pN  (i*e*,  the  approximate  amount  of  samples  that  will  be  used  in  parameter 
updating)  cannot  be  too  small.  Thus,  small  values  of  p  will  require  a  large  number  of  samples  to 
be  generated  at  each  iteration  and  may  result  in  significant  simulation  efforts* 

In  order  to  address  the  above  difficulties,  wc  adopted  in  Hu,  Fu,  and  Marcus  (2006b)  the  same 
idea  as  in  Homem-de-Mello  and  Rubinstein  (2003)  and  proposed  a  modified  Monte  Carlo  version 
of  MRASq  in  which  the  sample  size  N  is  adaptively  increasing  and  the  parameter  p  is  adaptively 
decreasing. 


Algorithm  Description  and  Convergence 

Roughly  speaking,  the  MRASi  algorithm  is  essentially  a  Monte  Carlo  version  of  MRASq  except 
that  the  parameter  p  and  the  sample  size  N  may  change  from  one  iteration  to  another.  The  rate 
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Algorithm  MRASo  Model  Reference  Adaptive  Search  -  Monte  Carlo 
version 

*  Initialization:  Specify  po  €  (0t  l]f  an  initial  sample  size  Nq  >  1,  £  >  0, 
a  >  lf  a  mixing  coefficient  A  €  (0, 1],  a  continuous  and  strictly  increasing 
function  S(  )  :  3f?  — *  3?+,  and  an  initial  p.d.f.  f(x,  do)  >  0  Vx  €  X.  Set 

■*—  k  *“  0. 

*  Repeat  until  a  specified  stopping  rule  is  satisfied: 

1.  Generate  Nk  U,d.  samples 

~ /(-,&)  :=  (1  -  >)/(-, ft.)+A/(vfc>- 

2.  Compute  the  sample  (l-pfc)-quantile  7±+t(PfcT  At)  :=  H(\{i-f>k)Nk\)* 
where  \a]  is  the  smallest  integer  greater  than  a,  and  is  the  ttb 
order  statistic  of  the  sequence  {  //  (X^  ),  i  =  1( . . , ,  JV*  } . 

3.  U  k  =  0  or  7fc+i (Pk,Nk)  >  7k  +  §,  then 

3a.  Set  7jt+]  *—  7jt+i  (pk<  Njt),  pjt+i  *“/>*♦  ATfc  +  i  +—  Nk, 
else,  find  the  largest  p  £  (0 ,pk)  such  that  7 k  +  i(p*Nk)  >  7*:  + 

3b.  If  such  a  p  exists, 

then  set  7*+i  *-  7fe+i(A  Nk),  pk+i  <—  ps  Nk+i  <—  Nk - 
3c.  else  (if  no  such  p  exists), 

set  7*+i  +“  7J t,  pk+i  <—  pfc,  Nfe+i  [aAfcl. 

endif 

4.  Compute  as 


Jv* 


(8) 


5,  Set  &  *—  A:  4-  1. 


of  increase  in  the  sample  size  is  controlled  by  an  extra  parameter  a  >  1,  specified  during  the 
initialization  step.  For  example,  if  the  initial  sample  size  is  Nq ,  then  after  k  increases,  the  sample 
size  will  be  approximately  [a*iVol- 

In  Hu,  Fu,  and  Marcus  (2Q(Xib),  finite  time  e-optimality,  with  probability  1,  of  this  Monte- 
Carlo  version  has  been  proved.  Numerical  studies  have  shown  that  the  algorithm  is  effective  on  a 
wide  range  of  problems,  including  continuous  problems  with  many  local  optima,  and  combinatorial 
problems  such  as  asymmetric  traveling  salesman  problems.  The  algorithm  has  also  performed  well 
on  problems  of  topology  configuration  in  Wave  Division  Multiplexed  (WDM)  optical  networks. 

2.4  Stochastic  Model  Reference  Adaptive  Control  (SMRAS) 

In  Hu,  Fu,  and  Marcus  (2006c),  we  have  extended  the  MR  AS  method  to  stochastic  optimization 
problems,  where  the  function  values  can  only  be  observed  in  the  presence  of  noise.  Denoting  H(x) 
as  the  random  observation  of  the  true  function  value  H{x)  made  at  point  x,  the  stochastic  version 
of  problem  (!)  can  be  formulated  as 

x*  £  argmaxf^f/fx)],  x  £  A*  C  Rn\  (9) 
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where  E( ■}  is  the  expectation  with  respect  to  the  probability  distribution  of  the  observation  noise. 
Since  an  unbiased  estimate  of  £[#{#)]  is 


1  M  - 


i= 1 


where  Hi(x)t  i  —  1, . . . ,  M  are  i.Ld,  observations  made  at  x,  it  would  be  natural  to  generalize  the 
performance  function  [5{//(x))]fr  in  MR  AS  to 

k 

Sk(H(x)):=US(Hi(x)).  (10) 

*=i 

Clearly  for  the  deterministic  case  (i.e.,  no  observation  noise)  we  will  have  the  original  performance 
function.  In  particular,  if  we  take  £(*)  to  be  an  exponential  function  (e.g<,  S(H(x))  =  ew^),  then 
equation  (10)  can  be  written  as 

k  ^ 

Sk(H(x))  :=  exp  (  £  ifi(x)). 

i=  1 

Therefore,  by  the  strong  law  of  large  numbers,  it  is  possible  to  show  that  MRAS  with  the  generalized 
performance  function  will  converge  w.p,  1  to  an  optimal  solution  of  (9).  However,  for  this  generalized 
performance  function,  we  need  to  keep  track  of  all  the  past  observations  made  at  all  points  that 
have  been  visited  thus  far,  which  could  be  computationally  difficult  to  handle  when  the  solution 
space  is  large  or  uncountable. 

A  major  modification  from  the  original  MRAS  method  is  in  the  way  the  sequence  of  reference 
distributions  is  constructed.  In  MRAS,  reference  distributions  are  idealized  probabilistic  models 
constructed  based  on  the  exact  performance  of  the  candidate  solutions.  In  the  stochastic  case, 
however,  the  objective  function  cannot  be  evaluated  deterministically,  so  the  sample  average  ap¬ 
proximations  of  the  (idealized)  reference  distributions  are  used  in  SMRAS  to  guide  the  parameter 
updating.  We  show  in  Hu,  Fu,  and  Marcus  (2006c)  that  for  the  Natural  Exponential  Family  (NEF), 
SMRAS  converges  with  probability  one  to  a  global  optimal  solution  for  both  stochastic  continuous 
and  discrete  problems.  To  the  best  of  our  knowledge,  SMRAS  is  the  first  model-based  search  method 
for  solving  general  stochastic  optimization  problems  with  provable  convergence.  The  algorithm  has 
been  shown  to  perform  efficiently  on  a  range  of  stochastic  problems,  including  problems  of  buffer 
allocation  and  inventory  control. 


3  Simulation-Based  and  Sampling  Methods  for  MDPs 


3.1  Efficient  Simulation  Allocation  via  Adaptive  Sampling 

The  basic  MDP  model  we  consider  in  this  section  is  specified  by  the  following  notation: 


Xi 

T 

S 

A 

fi(x,a,u>) 


Ri(x ,  a,  lj) 
Ai  e  A(Xi) 


state  in  period  i; 

time  horizon,  or  number  of  periods  (also  known  as  stages); 
state  space; 
action  space; 

transition  function  in  period  i  for  action  a  taken  in  state  x, 
where  u)  represents  the  stochastic  effects  (e.g.,  a  sample  path); 
one- period  reward  in  period  i  for  action  a  taken  in  state  x, 
action  taken  in  period  i. 
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Thus,  the  MDP  {Xt,i  =  0,1,. ..,T}  receives  reward  in  period  i  and  then  transitions 

according  to 

Xi=fi(Xi-l,Ai-u*)- 


The  objective  is  to  find  a  feedback  control  policy  {^(x)}^1  -  a  mapping  specifying  the  action 
taken  when  in  state  x  in  period  i  —  that  maximizes  an  expected  reward  function,  which,  for 
simplicity,  we  take  here  to  be  the  finite  horizon  discounted  total  reward:  (see  Arapostathis  et 
al.  1993): 


E 


T- 1 


(11) 


li=0  J 

where  a  is  the  (one-period)  discount  factor;  A  key  consideration  is  that,  simulation  is  required  for 
the  system  dynamics  (state  transitions)  and/or  period  rewards. 

We  define  some  familiar  quantities: 


Qi{x,  a)  =  (expected)  reward-to-go  (£?-value)  in  period  i  for  action  a  taken  in  state  x 
and  optimal  actions  taken  henceforth; 

Jt(x)  ~  optimal  value  function  in  period  i  for  state  x * 

Then  we  have  the  usual  Bellman  optimality  equation  (e*g*,  Puterman  1994,  Bertsckas  1995): 

Ji{x)  =  sup{F[/?i(i,o)  +  aJi+i(/i+i(*.a)))}  ,  (12) 

a 

written  here  in  two-part  form: 

Qi(x,a)  =  E[Ri(s,a)  +  aJk+i(fi+i{x,a))] ,  (13) 

Mx)  —  sup<2i{x,a),  (14) 

a 

where  for  notational  simplification,  we  henceforth  drop  explicit  display  of  An  optimal  policy  in 
period  i  will  be  denoted  by 

tv* (x)  €  argsup<5i(x,  a ),  i  —  0,  T  —  1,  x  €  S{i).  (15) 

a 

In  some  applications,  such  as  rolling-horizon  control  and  derivatives  pricing  problems  in  finance, 
the  goal  is  to  estimate  the  optimal  value  function,  t.e*,  Jq(xq)  for  a  particular  initial  value  xo,  rather 
than  the  entire  optimal  policy  If  sampling  is  required  to  estimate  the  expectations  involved,  then 
the  obvious  way  to  attack  the  Bellman  equation  given  by  (12),  or  (13)  and  (14)  is  simply  to  replace 
corresponding  expectation  quantities  with  their  sample  means.  However,  given  a  total  sampling 
budget,  there  is  the  question  of  how  the  budget  should  be  allocated,  both  in  terms  of  periods  and 
in  terms  of  actions* 

To  simplify  the  exposition  in  order  to  enhance  understanding  and  intuition,  wc  begin  by  placing 
some  additional  assumptions,  primarily  to  reduce  the  notational  burden*  Assume  that  A  is  discrete 
and  finite,  so  that  the  “sup”  operation  in  the  Bellman  optimality  equation  becomes  a  “max” 
operation  over  a  finite  set,  e.g.,  (12)  becomes 

Ji{x)  =  max  {E  [/£*(£, a)  +  a.li+\{fi+i(x>a))]} . 

a£A 

Again,  the  objective  is  to  efficiently  estimate  J*{x),  based  on  sample  paths  of  future  transitions  and 
rewards.  The  estimate*  along  with  the  “best  guess”  for  tt*(x),  is  based  on  sampling  over  the  actions 
a  £  A  in  period  L  In  other  words,  our  problem  is  how  to  carry  out  the  sampling  of  actions  from  a 
visited  state  of  a  certain  period  in  a  sample  path.  We  will  assume  that  we  are  given  a  fixed  N,  the 
total  number  of  samples  to  be  distributed  among  the  feasible  actions,  and  N  >  \A\,  so  that  each 
action  can  be  sampled  at  least  once*  Then  the  remaining  question  is  how  often  should  wc  sample 
each  of  the  actions?  To  summarize,  our  problem  is  as  follows: 
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■  How  should  the  sampling  budget  N  be  distributed  among  the  feasible  actions  (in  a  period}? 

The  simplest  “solution”  is  what  we  call  the  equal  non- adaptive  scheme,  in  which  the  sampling 
budget  is  distributed  equally  among  the  feasible  actions,  i.e*,  AT/|^4|  per  action.  So,  for  example, 
if  there  are  ton  possible  actions  and  the  sampling  budget  is  100,  then  each  action  would  be  sam¬ 
pled  ten  times  to  obtain  sample  transitions  and  rewards.  Clearly  this  is  generally  sub-optimal, 
and  our  research  is  predicated  on  the  assumption  that  this  “equal”  sampling  can  often  lead  to  a 
tremendous  waste  of  resources,  which  can  be  critical  when  the  computational  (sampling)  budget  is 
tight,  A  simple  illustration  of  this  arising  in  the  previous  example  is  when  nine  of  the  ten  actions 
yield  estimates  of  Q{x,a)  that  are  nearly  deterministic,  whereas  the  remaining  action  has  a  lot  of 
variability  (relative  to  all  of  the  other  actions).  Then,  in  general,  it  would  make  much  more  sense 
to  concentrate  most  of  the  sampling  on  the  one  with  the  high  variability*  Exceptions  to  this  occur 
when  the  sample  estimate  of  the  action  with  high  variability  is  worse  than  that  of  another  action 
by  an  amount  far  exceeding  that  for  which  the  variability  could  ever  compensate;  or  when  there  is 
a  benefit  attached  to  sampling  the  best  action  more  often. 

Our  approach  for  adaptive  sampling  in  estimating  the  value  function  of  an  MDP  is  based  on 
ideas  from  multi-armed  bandit  problems  (cf.  Gittins  1989,  Berry  and  FYistedt  1986).  The  objective 
of  these  problems  is  to  “play”  (select)  as  often  as  possible  the  “arm”  ,  which  we  will  call  a  machine 
henceforth,  that  yields  the  highest  (expected)  reward.  The  optimal  policy  must  balance  between 
playing  the  machine  that  is  empirically  best  thus  far  (exploitation)  —  i.e.,  it  has  the  highest 
sample  mean,  but  not  necessarily  the  highest  expectation  —  and  trying  to  find  a  better  machine 
(exploration),  Le*,  a  machine  that  actually  has  a  higher  expectation  but  might  have  a  lower  sample 
mean  thus  far  due  to  statistical  variation.  Our  idea  Ls  to  incorporate  results  from  this  rich  literature 
into  a  sampling-based  process  for  finding  an  optimal  action  iri  a  state  for  a  single  period  of  an  MDP . 
We  then  extend  the  one-stage  sampling  process  into  multiple  stages  in  a  recursive  manner,  leading 
to  a  multi-stage  (sampling-based)  approximation  algorithm  for  solving  MDPs.  Thus,  by  applying 
the  theory  of  multi-armed  bandit  problems,  we  are  able  to  derive  a  provably  convergent  algorithm 
for  solving  general  finite-horizon  MDPs. 

The  algorithm  adaptively  chooses  which  action  to  sample  as  the  sampling  process  proceeds,  and 
provides  an  asymptotically  unbiased  estimator  with  worst-case  bias  of  O  (Tin  N/N)  and  worst -case 
time-complexity  of  O  ^()*4|7V)^,  which  is  independent  of  the  size  of  the  state  space  but  depends 
on  the  size  of  the  action  space  due  to  the  requirement  that  each  action  be  sampled  at  least  once  at 
each  reached  state. 

Suppose  we  estimate  Qt(x,a)  by  a  sample  mean  Qi(x,  a)  for  each  action  a  £  A,  where 

&(*,«)  =  £((*,  a) +  aT^-f  £  j£U(v),  (16) 

’■  w  ires^ 

where  S*t  is  the  multiset  (which  means  a  set  that  may  include  repeated  members,  Lc*,  the  same 
element  more  than  once)  of  (independently)  sampled  next  states  from  state  x  in  period  i  taking 
action  a.  |5*  J  >  i  (all  actions  from  a  state  must  be  sampled  at  least  once)  and  YlaeA  l^Kil  =  ^ 
(so  that  the  total  number  of  sampled  (next)  states  is  0(NT ),  independent  of  the  state  space  size), 
Ri(x,a)  is  the  sample  mean  for  the  ith  period  reward,  and 

ae/t  J 

is  an  estimate  of  Ji(x) 

J?(x)  = 


.  This  leads  to  the  following  recursion: 

(  i  \ 

ftfool  +  arar;  £  J!Uv) 

\  Pa.il  .  y 


y*  |Sg,tl 


,i  =  0,...,T-  1, 
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with  Jr(x)  ~  0  for  all  x  €  S,  Again,  an  adaptive  scheme  will  specify  a  sequential  selection  of  the 
next  action  to  be  sampled  in  a  given  state  in  a  given  period,  which  eventually  determines 

The  main  idea  behind  our  adaptive  allocation  rule  is  based  on  a  simple  interpretation  of  the 
regret  analysis  of  the  multi-armed  bandit  problem,  where  plays  of  machine  m  yield  iJ.d.  random 
rewards  with  unknown  mean  pm,  and  the  goal  is  to  find/play  the  machine  corresponding  to  the 
maximum  pm.  The  rewards  across  machines  are  also  assumed  to  be  independently  generated.  Let 
Tm(n)  be  the  number  of  times  machine  m  has  been  played  by  an  algorithm  during  the  first  n  plays. 
Define  the  expected  regret  p(n)  of  an  algorithm  after  n  plays  by 

M 

p(n)  =  p*n  -  ^  /rm£[rm(n)j  where  p*  :=  max^, 
m= i 

%vherc  M  is  the  number  of  possible  machines.  Lai  and  Robbins  (1985)  characterized  an  “optimal11 
algorithm  such  that  the  best  machine,  which  is  associated  with  p* „  is  played  exponentially  more 
often  than  any  other  machine,  at  least  asymptotically.  That  is,  they  showed  that  playing  machines 
according  to  an  (asymptotically)  optimal  algorithm  leads  to  p(n)  =  0{lnn)  as  n  oo  under  mild 
assumptions  on  the  reward  distributions.  Unfortunately,  obtaining  an  optimal  algorithm  is  often 
very  difficult,  so  Agrawal  (1995)  derived  a  set  of  simple  algorithms  that  achieve  the  asymptotic 
logarithmic  regret  behavior,  using  a  form  of  upper  confidence  bounds ,  The  temptation  to  play  only 
the  machine  with  the  current  maximum  sample  mean  (exploitation)  is  tempered  by  the  uncer¬ 
tainty  associated  with  estimation,  which  motivates  the  need  to  play  other  machines  occasionally 
(exploration).  Let  /jm(n)  denote  the  machine  rn  sample  mean,  averaged  over  the  number  of  plays 
of  that  machine,  usually  different  from  ii,  which  denotes  the  total  (over  all  machines)  number  of 
plays  so  far.  To  account  for  the  randomness  in  the  estimation,  wc  find  a  function  a m(n)  such  that 
Pm(n)  -  <7 m (n)  <  Pjn  <  pm(n)  +  crm(n)  with  high  probability,  where  £m(n)  +  #m(n)  is  the  upper 
confidence  bound  that  guides  exploration.  At  each  play,  the  algorithms  choose  the  machine  with 
the  current  highest  upper  confidence  bound. 

For  an  intuitive  description  of  the  allocation  rule,  consider  first  only  a  one-stage  approximation, 
where  wc  assume  for  now  that  we  know  J\{x)  for  all  x  6  S.  Then  to  estimate  Jq(z),  wc  need  to 
estimate  Qo(x,a*),  where  a *  e  argmaxa  Qq(x, a).  The  search  for  a*  corresponds  to  the  search  for 
the  best  machine  in  the  multi-armed  bandit  problem.  We  start  by  sampling  each  possible  action 
once  at  x,  which  gives  a  sample  one- period  reward  and  leads  to  the  sampled  next  state.  We  then 
iterate  (see  Loop  in  Figure  1)  by  sampling  the  action  that  achieves  the  maximum  among  the 
current  estimates  of  Qo{x>a)  plus  its  current  upper  confidence  bound  (see  Equation  (18)),  where 
the  estimate  Qg(x,g)  is  given  by  the  immediate  reward  plus  the  sample  mean  of  J\- values  at  the 
visited  next  states  that  have  been  sampled  so  far  (see  Equation  (1G)).  If  the  sampling  is  done 
appropriately,  \S*{]\/N  should  provide  a  good  estimate  of  the  likelihood  that  action  a  is  optimal  in 
state  x;  for  a*  unique,  we  would  expect  |S^.>0)/iV  — +  1  in  the  limit  as  N  — *  oo.  Therefore,  we  use 
a  weighted  (by  |S^q|/JV)  sum  of  the  currently  estimated  value  of  Qo(x,  a)  over  A  to  approximate 
Jq(x)  (see  Equation  (19)).  Ensuring  that  the  weighted  sum  concentrates  on  a*  (even  if  not  unique) 
as  the  sampling  proceeds  will  ensure  that  in  the  limit  the  estimate  J$  (x)  converges  to  Jb(x). 

Wc  now  provide  a  high-level  description  of  the  adaptive  multi-stage  sampling  (AMS)  algorithm 
given  in  Figure  1,  The  inputs  to  AMS  are  a  state  x  €  <S,  N  >  \A\7  and  period  i,  and  the  output 
is  j/^x).  AMS  itself  is  recursively  called  to  estimate  J(!h{y)i  in  the  Initialization  and  Loop 
subroutines  of  the  algorithm.  The  initial  call  to  AMS  is  done  with  i  =  0  and  initial  state  xq, 
and  every  sampling  is  done  independently  of  the  previous  samplings.  To  help  understand  how  the 
recursive  calls  are  made  sequentially,  Figure  2  graphically  illustrates  the  sequence  of  calls  with  two 
actions,  A  =  and  T  -  3  for  the  Initialization  portion.  The  result  of  this  sampling  scheme, 

as  depicted  in  Figure  2,  resemble  simulated  trees  in  the  same  spirit  as  Broadie  and  Glasserman 
(1997)  use  for  an  American-styie  option  pricing  problem  and  Fu  and  Jin  (2002)  use  in  a  more 
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general  MDP  setting.  However,  both  of  those  works  use  non-adaptive  sampling,  in  the  sense  that  a 
fixed  number  of  samples  for  each  action  is  pre- specified;  furthermore,  all  of  the  simulated  trees  arc 
carried  out  in  their  entirety  prior  io  the  backwards  induction ,  In  our  scheme,  the  sampling  of  actions 
is  adaptive,  and  moreover  the  backwards  induction  is  integrated  recursively  with  the  sampling. 


Adaptive  Multi-stage  Sampling  (AMS) 

*  Input;  a  state  x  €  5,  Ar  >  |A|,  and  period  t  Output:  Jfi(x),  where  Jt(')  =  0. 

*  Initialization:  Sample  each  feasible  action  a  6  A  once  from  state  r  and  set 

$*(«,«)-&(*, a) +  o#i(»),  (17) 


where  y  is  the  sampled  next,  state,  and  set  n  —  |*4|. 
*  Loop:  Sample  each  feasible  action  s.t. 


a* 


€  arg  max 

a£A 


(<2i(*.a)  + 


(18) 


where  |5*  J  is  the  number  of  times  action  a  has  been  sampled  so  far. 
and  n  is  the  overall  number  of  samples  done  so  far  for  this  stage. 

-  Update  $**1  *—  S**  ( U  {y*}i  where  y  is  the  newly  visited  next  state  from  taking  a \ 

-  Update  Qi(a:,a’}  using  the  current  according  to  ( 16). 

-  h  *—  n  +  1. 

-  If  h  =  N,  then  go  to  Exit;  else,  continue  Loop, 

*  Exit:  Return 

(19) 


Figure  1:  Algorithm  incorporating  adaptive  sampling. 


It  is  not  difficult  to  show*  that  the  time- complexity  of  the  AMS  algorithm  is  0((|A|AOr).  In 
contrast,  the  time-complexity  for  backward  induction  is  0(|*A||«S|2T).  Therefore,  the  main  benefit  of 
the  proposed  AMS  algorithm  is  independence  horn  the  state  space  size,  due  to  the  sampling  nature 
of  the  algorithm,  although  this  comes  at  the  expense  of  exponential  (versus  linear,  for  backwards 
induction)  dependence  on  both  the  action  space  and  the  horizon  length.  Thus,  the  algorithm  is 
most  appropriate  for  MDPs  with  large  state  spaces  but  relatively  small  action  spaces.  In  terms  of 
theory,  we  have  the  following  rudimentary  convergence  result: 

Theorem  (Chang,  flu,  Fu,  and  Marcus  2005).  For  any  state  xo, 

lim  E[j(^(x0)]  —  Jq(x0). 

N— »oo 


3-2  Additional  Methods 

An  alternative  adaptive  sampling  algorithm,  called  the  Recursive  Automata  Sampling  Algorithm 
(RASA)  for  control  of  finite  horizon  MDPs  is  presented  in  Chang,  Fu,  Hu,  and  Marcus  (2007).  By 
extending  in  a  recursive  manner  an  algorithm  from  learning  automata  called  the  Pursuit  algorithm, 
RASA  returns  estimates  of  both  the  optimal  action  from  a  given  state  and  the  corresponding 
optimal  value.  For  a  given  initial  state,  we  derive  the  following  probability  bounds  as  a  function  of 
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Figure  2:  Graphical  illustration  of  the  sequence  of  the  recursive  calls  made  in  Initialization  of  the 
AMS  algorithm.  Each  circle  corresponds  to  a  state  and  each  arrow  with  noted  action  signifies  a 
sampling  (and  a  recursive  call).  The  bold- face  number  near  each  arrow  is  the  sequence  number  for 
the  recursive  calls  made.  For  simplicity,  the  entire  Loop  process  is  signified  by  one  call  number. 

the  number  of  samples:  (i)  a  lower  bound  on  the  probability  that  RASA  will  sample  the  optimal 
action;  and  (ii)  an  upper  bound  on  the  probability  that  the  deviation  between  the  true  optimal 
value  and  the  RASA  estimate  exceeds  a  given  error. 

In  recent  work  (Chang,  Fu,  and  Marcus  2006),  we  have  developed  a  sampling-based  algorithm 
for  solving  stochastic  optimization  problems,  based  on  an  algorithm  for  solving  adversarial  multi¬ 
armed  bandit  problems.  We  then  recursively  extend  the  algorithm  for  the  solution  of  finite  horizon 
MDPs  and  analyze  its  performance,  showing  that  an  upper  bound  on  the  expected  bias  approaches 
zero  as  the  sampling  size  per  stage  approaches  infinity,  leading  to  the  convergence  to  the  optimal 
value  of  the  MDP* 

A  methodology  that  utilizes  the  approach  of  updating  a  probability  distribution,  but  in  the 
context  of  solving  MDPs,  has  been  developed  in  Chang,  Fu,  Hu,  and  Marcus  (2006).  A  simulation- 
based  algorithm,  called  Simulated  Annealing  Multiplicative  Weights  (SAMW),  was  proposed  for 
solving  large  finite  horizon  MDPs.  At  each  iteration  of  the  algorithm,  a  probability  distribution  over 
candidate  policies  is  updated  by  a  simple  multiplicative  weight  rule,  and  with  proper  annealing  of  a 
control  parameter,  the  generated  sequence  of  distributions  converges  to  a  distribution  concentrated 
only  on  the  best  policies.  The  algorithm  is  asymptotically  efficient,  in  the  sense  that  for  the  goal 
of  estimating  the  value  of  an  optimal  policy,  a  provably  convergent  finite-time  upper  bound  for  the 
sample  mean  is  obtained, 

4  Population-Based  Evolutionary  Approaches  to  MDPs 

In  this  section,  we  discuss  our  research  on  evolutionary  population- based  algorithms  for  finding 
optimal  (stationary)  policies  infinite  horizon  MDPs.  These  algorithms  are  primarily  intended  for 
problems  with  large  (possibly  uncountable)  action  spaces  where  the  policy  improvement  step  in 
Policy  Iteration  (PI)  becomes  computationally  prohibitive,  and  value  iteration  is  also  impractical. 
In  particular,  for  PI,  maximizing  over  the  entire  action  space  may  require  enumeration  or  random 
search  methods.  The  computational  complexity  of  each  iteration  of  our  algorithms  is  polynomial  in 
the  size  of  the  state  space,  but  unlike  PI  and  Value  Iteration  (VI),  it  is  insensitive  to  the  size  of  the 
action  space,  making  the  algorithms  most  suitable  for  problems  with  relatively  small  state  spaces 
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compared  to  the  size  of  the  action  spaces.  In  the  case  of  uncountable  action  spaces,  our  approach 
avoids  the  need  for  any  discretization;  discretization  can  lead  to  computational  difficulties,  either 
resulting  in  an  action  space  that  is  too  large  or  in  a  solution  that  is  not  accurate  enough. 

The  approach  taken  by  our  algorithms  directly  searches  the  policy  space  to  avoid  carrying  out 
an  optimization  over  the  entire  action  space  at  each  PI  step,  and  resembles  that  of  a  standard 
genetic  algorithm  (GA),  updating  a  population  of  policies  using  appropriate  analogous  operations 
for  the  MDP  setting.  One  key  feature  of  the  algorithms  presented  here  is  the  deterini nation  of  an 
elite  policy  that  is  superior  to  the  performances  of  all  policies  in  the  previous  population.  This 
monotonia ty  property  ensures  that  the  algorithms  converge  with  probability  one  to  a  population 
in  which  the  elite  policy  is  an  optimal  policy. 

In  Chang,  Lee,  Fu  and  Marcus  (2005),  we  proposed  a  novel  algorithm  called  Evolutionary 
Policy  Iteration  (EPI)  for  solving  infinite  horizon  discounted  reward  MDPs.  EPI  inherits  the  spirit 
of  the  policy  iteration  (PI)  algorithm  but  eliminates  the  need  to  maximize  over  the  entire  action 
space  in  the  policy  improvement  step  by  directly  manipulating  policies  via  a  method  called  “policy 
switching”  that  generates  an  improved  policy  from  a  set  of  given  policies,  with  a  computation  time 
on  the  order  of  the  size  of  the  state  space.  EPI  iteratively  generates  a  population  (or  set)  of  policies 
such  that  the  performance  of  the  elite  policy”  for  a  population  is  monotonically  improved  with 
respect  to  a  defined  fitness  function.  Each  iteration  of  the  algorithms  consists  of  two  main  steps: 
generation  of  an  elitist  policy  by  policy  switching,  and  exploration  of  the  policy  space  by  generating 
additional  policies  via  mutation  and  policy  switching.  The  algorithm  converges  to  a  population 
that  contains  an  optimal  policy,  independent  of  the  initial  population 

This  work  is  extended  in  Hu,  Fu,  Ramezani,  and  Marcus  (2006),  where  a  new  randomized  search 
method  called  Evolutionary  Random  Policy  Search  (ERPS)  is  introduced;  ERPS  considerably 
enhances  the  EPI  algorithm  to  allow  it  to  be  more  efficient  for  practical  problems.  The  ERPS 
algorithm  approaches  an  MDP  by  iteratively  dividing  it  into  a  sequence  of  smaller,  random,  sub- 
MDP  problems  based  on  information  obtained  from  random  sampling  of  the  entire  actions  space 
and  local  search,  to  extract  a  convergent  sequence  of  policies  via  solving  these  smaller  problems. 
It  thus  improves  upon  both  the  elitist  policy  determination  and  the  mutation  step  by  solving  a 
sequence  of  sufcnMDP  problems  defined  on  smaller  policy  spaces.  Each  sub-MDP  is  then  solved 
approximately  by  using  a  variant  of  PL  where  an  elite  policy  is  obtained. 

As  in  EPI,  each  iteration  of  ERPS  has  two  main  steps: 

1,  An  elitist  policy  is  generated  by  solving  the  sub-MDP  problem  constructed  in  the  previous 
iteration  using  a  variant  of  the  policy  improvement  technique  called  policy  improvement  with 
cost  swapping  (PICS). 

2.  Based  on  the  elitist  policy,  a  group  of  policies  is  then  obtained  by  using  a  “nearest  neighbor” 
heuristic  and  random  sampling  of  the  entire  action  space,  from  which  a  new  sub-MDP  is 
created  by  restricting  the  original  MDP  problem  (e.g.,  cost  structure,  transition  probabili¬ 
ties)  to  the  current  available  subsets  of  actions.  The  “nearest  neighbor”  heuristic  provides 
a  local  search  mechanism  that  leads  to  rapid  convergence  once  a  policy  is  found  in  a  small 
neighborhood  of  an  optimal  policy. 

Whereas  EPI  treats  policies  as  the  most  essential  elements  in  the  action  optimization  step,  and 
each  "elite”  policy  is  directly  generated  from  a  group  of  policies,  in  ERPS  policies  arc  regarded 
as  intermediate  constructions  from  which  sub-MDP  problems  are  then  constructed  and  solved. 
This  modification  substantially  improves  the  performance  while  maintaining  the  computational 
complexity  at  essentially  the  same  level.  It  is  proved  that  the  sequence  of  elite  policies  converges  to 
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an  epsilon-optimal  policy  with  probability  one,  and  numerical  studies  arc  used  to  compare  ERPS 
to  other  algorithms. 

5  Risk-Sensitive  Control  and  Estimation 

In  Ramezani  and  Marcus  (2005),  wc  have  viewed  the  probability  distribution  of  a  Markov  chain 
as  the  information  state  of  an  additive  optimization  problem.  This  optimization  problem  is  then 
generalized  to  a  product  form  whose  information  state  gives  rise  to  a  generalized  notion  of  probabil¬ 
ity  distribution  for  Markov  chains.  The  evolution  and  the  asymptotic  behavior  of  this  generalized 
or  risk-sensitive  probability  distribution  is  studied,  and  a  conjecture  is  proposed  regarding  the 
asymptotic  periodicity  of  risk-sensitive  probability  and  is  proved  in  the  two  dimensional  case. 

Product  estimators  for  partially  observed  Markov  chains  are  introduced  in  Ramezani,  Marcus, 
and  Fu  (2004),  and  a  notion  of  risk-sensitivity  for  which  the  risk  is  non-uniform  and  state-dependent 
is  defined.  Product  probability  is  introduced  and  studied  in  the  context  of  left-toright  Markov 
chains  for  uniform  and  state- dependent  cases.  It  is  shown  that  the  qualitative  behavior  of  these 
estimators  is  related  to  certain  threshold  properties.  In  Ramezani,  Fu,  and  Marcus  (2005),  wc  con¬ 
sider  the  relationship  between  risk-sensitivity  and  information.  Product  estimators  are  introduced 
as  a  generalization  of  Maximum  A  Posteriori  Probability  (MAP)  estimator  for  Hidden  Markov 
Models,  We  study  the  relationship  between  the  inclusion  of  higher  order  moments,  the  underlying 
dynamics  and  the  availability  of  information.  Asymptotic  periodicity  of  these  estimators  and  the 
relationship  between  risk  and  information  is  studied  via  simulation. 


6  Optimization,  Estimation,  and  Control  in  Graphical  Models  and 
Networked  Systems 

We  made  considerable  progress  in  our  work  on  scalable  algorithms  for  inference  in  graphical  models 
(Chen,  Get  in  and  Wilisky  2005a;  Chen,  Cetin,  and  Willsky  2005b;  Chen,  Wain  wright,  Get  in,  and 
Willsky  2006;  Ihler,  Fisher,  Moses,  and  Willsky  2005;  Ihler,  Fisher,  and  Willsky  2006;  Johnson, 
Malioutov,  and  Willsky  2006).  One  of  the  applications  of  our  methodologies  that  wc  have  explored 
is  that  of  multisensor,  multitarget  data  association,  a  notoriously  complex  problem.  We  have 
now  demonstrated  that  our  new  algorithms  can  yield  remarkably  efficient  solutions  to  optimal  data 
association  problems  that  have  heretofore  been  considered  too  complex  for  practical  solution  {hence 
requiring  the  use  of  heuristics  to  obtain  tractable,  but  suboptimal,  solutions).  In  addition,  with 
an  eye  toward  implementation  in  distributed  sensor  networks,  we  have  developed  a  local,  adaptive 
version  of  these  data  association  algorithms  in  which,  at  each  iteration,  each  node  in  the  network 
can  decide  whether  to  transmit  a  message  to  each  of  its  immediate  neighbors  based  on  whether 
the  potential  new  message  differs  in  a  statistically  significant  manner  from  the  previous  message 
that  was  sent  to  that  neighbor.  We  have  shown  that  this  locally  adaptive  algorithm  can  result 
in  dramatic  reductions  in  computationsand  communications,  if  these  messages  were  indeed  sent 
through  a  sensor  networkwith  minimal  decrease  in  association  performance, 

Wc  have  developed  a  new  approach  to  inference  for  graphical  models  that  involve  non- Gaussian 
densities-problems  of  particular  importance  for  various  sensing  modalities  that  provide  measure¬ 
ments  of  either  bearing  or  range.  These  methods,  which  involve  the  use  of  methods  for  nonpara- 
metric  density  estimation  (for  which  reason  we  refer  to  them  as  Non  parametric  Belief  Propagation 
(NBP)  algorithms),  can  be  viewed  as  extensions  of  concepts  of  particle  filtering  to  inference  on 
graphs-this  extension  is  highly  nontrivial,  especially  for  graphs  with  loops,  as  the  iterative  compu¬ 
tations  and  generation  of  messages  of  belief  propagation  require  new  ideas  for  generating  particles 
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to  replace  those  messages.  In  addition  to  developing  the  basic  methodology,  we  have  also  explored 
applications  in  both  computer  vision  and  in  fusion  for  sensor  networks. 

We  have  completed  a  study  of  the  communications  cost  /estimation  accuracy  tradeoff  for  particle- 
based  representations  such  as  those  used  in  NBP.  This  work  provides  a  systematic  approach  to  fully 
adaptive  algorithms  that  directly  tradeoff  accuracy  in  the  transmitted  particle-based  message  for 
the  total  communications  requirement  associated  with  that  message.  Combining  this  with  our  work 
on  relating  errors  between  exact  and  transmitted  messages  and  overall  estimation  accuracy,  this 
IS  now  the  first  available  methodology  for  directly  trading  off  overall  network  estimation  accuracy 
versus  communications  requirements. 

We  have  also  made  considerable  advances  in  understanding  inference  for  Gaussian  graphical 
models,  developing  both  very  powerful,  scalable,  and  accurate  methods  for  covariance  calculation 
for  very  large  problems  and  also  developing  a  new  framework  for  analyzing  and  understanding 
distributed  message- passing  algorithms  (based  on  the  idea  of  so-called  walk-sums)  that  provide 
easily  computable  sufficient  conditions,  as  well  as  complete  necessary  and  sufficient  conditions  for 
convergence  of  the  well-known  Belief  Propagation  (BP)  algorithm.  This  perspective  also  suggests 
ways  in  which  to  achieve  better  performance  than  BP  through  more  effective  exploitation  of  local 
memory  and  computation  in  a  distributed  fusion  system. 

7  Additional  Research  Progress 

We  have  also  made  significant  progress  in  the  following  areas: 

•  Optimal  allocation  of  simulation  budget  in  simulation- based  optimization  (Chen  et,  ah  2004a, 
Fu  et.  al.  2004,  Fu  et.  al.  2006); 

#  New  results  in  zero-sum  Markov  games  (Chang  and  Fu  2004); 

*  Applications  in  inventory  control,  telecommunications,  preventive  maintenance  and  produc¬ 
tion  control  (Zhang  and  Fu  2005,  Chen  et.  ah  2004b*  Ridley  et.  al.  2004 T  Yao  et.  al.  2006). 


8  Research  Output 

8.1  Journal  Publications 

•  H.3.  Chang,  M.C.  Fu,  J.  Hu,  and  SJ.  Marcus,  11  An  Adaptive  Sampling  Algorithm  for  Solving 
Markov  Decision  Processes,"  Operations  Research,  53,  January- February  2005,  126-139. 

•  A.T.  Ihler,  J.W.  Fisher,  III,  R.L,  Moses,  and  A.S.  Willsky,  “Nonparametric  Belief  Propagation 
for  Self- Localization  of  Sensor  Networks,”  IEEE  J,  on  Select  Areas  in  Communication,  special 
issue  on  Distributed  Collaborative  Sensor  Networks,  23,  April  2005,  809-819. 

•  V.  Ramezani  and  S.L  Marcus,  “Risk  Sensitive  Probability  for  Markov  Chains/’  Systems  and 
Control  Letters,  54*  May  2005*  493-502. 

•  X.  Yao,  X.  Xie,  M.C.  Fu*  and  S.L  Marcus,  “Optimal  Joint  Preventive  Maintenance  and 
Production  Policies,”  Naval  Research  Logistics,  52,  October  2005*  668-681. 

•  H.S.  Chang*  H.G,  Lee,  M.C.  Fu,  and  S.I.  Marcus*  “Evolutionary  Policy  Iteration  for  Solving 
Markov  Decision  Processes,”  IEEE  Transactions  on  Automatic  Control*  50*  November  2005* 
1804-1808. 
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•  S.B.  Laprise,  M.C.  Fu,  S.I.  Marcus,  and  A.E.B.  Lim,  “Pricing  American- Style  Derivatives 
with  European  Call  Options,”  Management  Science,  52,  January  2000,  95-110. 

•  L.  Chen,  MJ.  Wainwright,  M,  Cetin,  and  A.S.  Willsky,  “Data  Association  Based  on  Opti¬ 
mization  in  Graphical  Models  with  Application  to  Sensor  Networks,”  invited  paper  in  special 
issue  of  Mathematical  and  Computer  Modeling  on  Optimization  and  Control  for  Military 
Applications,  43,  May  2006,  1114-1135. 

«  C.H,  Chen,  D.  He,  and  M,  Fu,  “Efficient  Dynamic  Simulation  Allocation  in  Ordinal  Opti¬ 
mization,"  JEEE  Transactions  on  Automatic  Control,  51,  December  2006,  2005-2009. 

•  M.C.  Fu,  J.Q.  Hu,  C.H,  Chen,  and  X.  Xiong,  "Simulation  Allocation  for  Determining  the 
Best  Design  in  the  Presence  of  Correlated  Sampling,”  INFORMS  Journal  on  Computing,  19, 
January  2007. 

•  H.S.  Chang,  M.C.  Fu,  J.Q.  Hu,  and  S.L  Marcus,  “An  Asymptotically  Efficient  Simulation- 
Based  Algorithm  for  Finite  Horizon  Stochastic  Dynamic  Programming,”  IEEE  Transactions 
on  Automatic  Control,  52,  January  2007,  89-94. 

•  J.  Hu,  M.C.  Fu,  V.  Ramczani,  and  S.L  Marcus,  “An  Evolutionary  Random  Policy  Search 
Algorithm  for  Solving  Markov  Decision  Processes,”  to  appear  in  INFORMS  Journal  on  Com¬ 
puting. 

•  J.  Hu,  M.C.  Fu,  and  S  J.  Marcus,  “A  Model  Reference  Adaptive  Search  Algorithm  for  Global 
Optimization,"  to  appear  in  Operations  Research. 

•  J.K.  Johnson,  D  M.  Malioutov,  and  A  S.  Willsky,  “Walk-Sums  and  Belief  Propagation  in 
Gaussian  Graphical  Models,”  to  appear  in  J.  Machine  Learning  Research. 

•  H.S.  Chang,  M.C.  Fu,  J.  Hu,  and  S.L  Marcus,  “A  Survey  of  Some  Simulation- based  Methods 
in  Markov  Decision  Processes,”  to  appear  in  Communications  in  Information  and  Systems, 
7,  2007. 

•  A.  Rawat,  H.  La,  M.  Shayman,  and  S.L  Marcus,  “Multicast  Traffic  Grooming  in  Unidirectional 
SONET/WDM  Ring,”  to  appear  in  IEEE  Journal  on  Selected  Areas  in  Communications, 
August  2007. 

•  H.S.  Chang,  M.C.  Fu,  J.  Hu,  and  S.L  Marcus,  “Recursive  Learning  Automata  Approach  to 
Markov  Decision  Processes,”  to  appear  in  IEEE  Transactions  on  Automatic  Control. 

8.2  Refereed  Proceedings  or  Book  Chapters 

•  H.  Zhang  and  M.C.  Fu,  “Sample  Path  Derivatives  for  (s,  S)  Inventory  Systems  with  Price 
Determination,”  in  The  Next  Wave  in  Computing,  Optimization,  and  Decision  Technologies, 
Bruce  Golden,  S.  Raghavan,  Edward  A.  Wasil,  editors,  Kluwer  Academic  Publishers,  229-246, 

2005. 

•  M.C.  Fu,  “Gradient  Estimation,”  Chapter  19  in  Handbooks  in  Operations  Research  and 
Management  Science:  Simulation,  S.G.  Henderson  and  B,L.  Nelson,  eds,  Elsevier,  575-616, 

2006. 
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•  M.C.  Fu,  “Sensitivity  Analysis  for  Simulation  of  Stochastic  Activity  Networks,”  in  Perspec¬ 
tives  in  Operations  Research:  Papers  in  Honor  of  Saul  Gass1  80th  Birthday,  F.B.  Alt,  M.C. 
Fu  and  B.L,  Golden,  editors,  Springer,  351-366,  2006* 

•  M.C.  Fu,  “Variance-Gamma  and  Monte  Carlo,11  in  Advances  in  Mathematical  Finance,  M.C. 
Fu,  R.A.  Jarrow,  J.-Y,  Yen,  and  R,  X  Elliott,  editors,  Birkhauser,  2007. 

•  C.-IL  Chen,  D.  He,  and  M.C.  Fu,  “A  Case  Study  for  Optimal  Dynamic  Simulation  Allocation 
in  Optimal  Optimization,"  Proc.  American  Control  Conference,  2004,  5754-5759. 

•  V.  Ramezani,  SX  Marcus,  and  M.C.  Fu,  “Structured  Risk-Sensitivity  for  Partially  Observed 
Markov  Chains,”  Proc.  43rd  IEEE  Conference  on  Decision  and  Control,  December  2004, 
3473-3478. 

•  H.S.  Chang  and  M.C.  Fu,  “Localization  for  a  Class  of  Two* Team  Zero-Sum  Markov  Games," 
Proc.  43rd  IEEE  Conference  on  Decision  and  Control,  December  2004,  4844-4849, 

•  M.  Chen,  XQ.  Hu,  and  M.C.  Fu,  “Fluid  Approximation  and  Perturbation  Analysis  of  a  Dy¬ 
namic  Priority  Call  Center,”  Proc.  43rd  IEEE  Conference  on  Decision  and  Control,  December 
2004,  2304-2309, 

»  V.  Ramezani,  M.C,  Fu,  and  SX  Marcus,  “Risk  and  Information  in  the  Estimation  of  Hidden 
Markov  Models,”  Proc.  2004  Winter  Simulation  Conference,  December  2004,  1596-1601. 

•  M.C.  Fu,  XQ.  Hu,  C.H.  Chen,  and  X.  Xiong,  “Optimal  Computing  Budget  Allocation  Under 
Correlated  Sampling,”  Proc.  2004  Winter  Simulation  Conference,  December  2004,  595-603. 

•  P.  Fard,  R.  J.  La,  K  Lee,  S.  I.  Marcus,  and  M.  Shayman,  “Reconfiguration  of  MPLS/WDM 
Networks  Using  Simulation-Based  Markov  Decision  Processes,”  Proc.  39th  Annual  Confer¬ 
ence  on  Information  Sciences  and  Systems,  Baltimore,  MD,  February  2005. 

•  L.  Chen,  M.  Get  in,  and  A.S.  Whisky,  “Graphical  Model-Based  Algorithms  for  Data  Asso¬ 
ciation  in  Distributed  Sensing,”  Adaptive  Sensor  Array  Processing  Workshop,  MIT  Lincoln 
Laboratory,  June  7-8,  2005, 

•  M.C.  Fut  J.Hu,  and  SX  Marcus,  “Population- Based  Evolutionary  Approaches  for  Solving 
Markov  Decision  Processes,”  Proc.  2005  IFORS  Conference,  July  1X15,  2005,  Honolulu, 
Hawaii. 

•  L.  Chen,  M.  Cetin,  and  A.S.  WilLsky,  “Distributed  Data  Association  for  Multi- Target  Tracking 
in  Sensor  Networks,”  Inti,  Conf.  On  Information  Fusion,  July  2005;  Best  Student  Paper 
Award, 

•  C.  Panayiotou,  W.C.  Howell,  and  M.C.  Fu,  “Online  Traffic  Light  Control  Through  Gradient 
Estimation  Using  Stochastic  Fluid  Models/1  Proc,  IFAC  Triennial  World  Congress,  2005. 

•  M.  C.  Fu,  “Sensitivity  Analysis  for  Stochastic  Activity  Networks,”  Proc.  International  Conf. 
on  Automatic  Control  and  Systems  Engineering,  2005. 

•  Y.  He,  M.  C,  Fu,  and  S.  I.  Marcus,  “A  Two-Timescale  Simulation- Based  Gradient  Algorithm 
for  Weighted  Cost  Markov  Decision  Proceases,”  Proc.  44th  IEEE  Conference  on  Decision  and 
Control,  December  2005,  8022-8027. 
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•  H.  S.  Chang,  M,  C.  Fu,  and  S.  L  Marcus,  “Recursive  Learning  Automata  for  Control  of 
Partially  Observable  Markov  Decision  Processes,”  Proc*  44th  IEEE  Conference  on  Decision 
and  Control,  December  2005,  6091-6096. 

•  R.  L,  Bennett,  M.  C.  Fu,  R.  Jarrow,  D.A.  NuxolL,  and  ti  Zhang,  “A  Loss  Default  Simula¬ 
tion  Model  of  the  Federal  Bank  Deposit  Insurance  Funds,”  Proc.  2005  Winter  Simulation 
Conference,  December  2005,  1835-1843. 

•  M.  C.  Fu,  F.W.  Glover,  and  J.  April,  “Simulation  Optimization:  A  Review,  New  Develop¬ 
ments,  and  Applications,”  Proc,  2005  Winter  Simulation  Conference,  December  2005,  83-95* 

•  J.  Hu,  M.  C.  Fu,  and  S.  L  Marcus,  “Stochastic  Optimization  using  Model  Reference  Adaptive 
Search,”  Proc.  2005  Winter  Simulation  Conference,  December  2005,  811-818. 

•  J,K.  Johnson,  D.M*  Malioutov,  and  A.S-  Willsky,  “Low- Rank  Variance  Estimation  in  Large- 
Scale  GMRF  Models,  ICASSP  2006,  Toulouse,  France;  winner  Outstanding  Student  Paper 
Award. 

•  M.C.  Fu,  J,  Hu,  anil  S*L  Marcus,  “Model-Based  Randomized  Methods  for  Global  Opti¬ 
mization,”  Proc.  17th  International  Symposium  on  Mathematical  Theory  of  Networks  and 
Systems,  Kyoto,  Japan,  July  24-28,  2006,  355-363. 

•  H.  Zhang  and  M*  C.  Fu,  “Applying  Model  Reference  Adaptive  Search  to  Americau-Style 
Option  Pricing,”  Proc.  2006  Winter  Simulation  Conference,  December  2006,  711-718. 

»  S*  Chang,  M.  C.  Fu,  and  S.  I.  Marcus,  “Adversarial  Multi- Armed  Bandit  Approach  to 
Stochastic  Optimization,”  Proc.  45th  IEEE  Conference  on  Decision  and  Control,  December 
2006,  5681-5686. 

•  Y*  Xin,  M.  Shayman,  R.J.  La,  and  SJ.  Marcus,  “Reconfiguration  of  Survivable  MPSL/WDM 
Networks,”  Proc*  IEEE  GLOBECOM  ,  San  Francisco,  CA,  Nov*  27-Dec.  I,  2006* 

•  M.  C,  Fu  and  W.C*  Howell,  “Traffic  Light  Signal  Optimization  via  Simulation,”  Proc.  In¬ 
ternational  Modeling  and  Simulation  Multiconference:  AI,  Simulation  and  Planning  in  High 
Autonomy  Systems  (AIS)  and  Conceptual  Modeling  and  Simulation  (CMS),  2007. 

8.3  Authored  Books  or  Monographs 

•  H.S.  Chang,  M.C.  Fu,  J.  Hu,  and  S.L  Marcus,  Simulation- based  Algorithms  for  Markov 
Decision  Processes,  Springer- Verlag,  2007  (research  monograph). 

8.4  Edited  Volumes 

•  F.B.  Alt,  M.C*  Fu  and  B.L*  Golden,  editors,  Perspectives  in  Operations  Research:  Papers  in 
Honor  of  Saul  Gass’  80th  Birthday,  Springer- Verlag,  2006. 

•  M.C.  Fu,  R*A.  Jarrow,  J.-Y,  Yen,  and  R.  J*  Elliott,  editors,  Advances  in  Mathematical 
Finance,  Birkhauser,  2007. 
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8.5  Awards 


•  Alan  Willsky;  2004  IEEE  Donald  G.  Fink  Prize  Paper  Award 

•  Alan  Willsky:  Univ.  de  Rennes,  Doctorate  Honoris  Causa,  2005 

•  Best  Student  Paper  Award:  L.  Chen,  M.  Cetin,  and  A.S.  Willsky,  Distributed  Data  Associa¬ 
tion  for  Multi- Target  Tracking  in  Sensor  Networks,  IntL  Conf,  On  Information  Fusion,  July 
2005. 

•  Outstanding  Student  Paper  Award:  J.K.  Johnson,  D.M.  Malioutov,  and  A.S.  Willsky,  Low- 
Rank  Variance  Estimation  in  Large-Scale  GMRF  Models,  ICASSP  2006,  Toulouse*  France 

8.6  Ph.D.  Students 

»  Martin  Wainwright,  Ph  D,  2005,  MIT,  supervised  by  A.  Willsky,  currently  Assistant  Professor 
in  EECS  and  Statistics,  Univ,  of  California,  Berkeley 

•  Alex  Ihler,  Ph.D,  2005,  MIT,  supervised  by  A.  Willsky,  currently  Assistant  Professor,  Toyota 
Technical  Institute,  Chicago 

•  Jiaqiao  Hu,  Ph.D,  2006,  Univ.  of  Maryland,  supervised  by  S.  Marcus  and  M.  Fu;  currently 
an  Assistant  Professor  in  Applied  Mathematics  and  Statistics  at  SUNY  Stony  Brook 

•  Enlu  Zhou,  Ph.D  expected  2008,  Univ.  of  Maryland,  supervised  by  S.  Marcus  and  M.  Fu 

•  Yongqiang  Wang,  Ph.D  expected  2010,  Univ.  of  Maryland,  supervised  by  S.  Marcus  and  M. 
Fu 

•  Lei  Chen,  Ph.D  expected  May  2007,  MIT,  supervised  by  A.  Willsky 

•  Dmitry  Malioutov,  Ph.D  expected  2008,  MIT,  supervised  by  A.  Willsky 

8.7  Postdocs 

•  Vahid  Ramezani,  2004-2005,  Univ,  of  Maryland,  supervised  by  S.  Marcus  and  M.  Fu;  cur¬ 
rently  at  IAI 

•  Ying  He,  2004-2005,  Univ.  of  Maryland,  supervised  by  S.  Marcus  and  M.  Fu;  currently  at 
NIH 
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